Skip to content

fix: software value failing for large repos [CM-1029]#3947

Merged
mbani01 merged 10 commits intomainfrom
fix/software_value_failure
Mar 25, 2026
Merged

fix: software value failing for large repos [CM-1029]#3947
mbani01 merged 10 commits intomainfrom
fix/software_value_failure

Conversation

@mbani01
Copy link
Contributor

@mbani01 mbani01 commented Mar 23, 2026

This pull request adds support for handling very large repositories in the software value calculation service. The main change is the introduction of a --no-large flag that, when enabled, skips files larger than 100MB to prevent out-of-memory errors during analysis. The Python service now automatically enables this flag for repositories larger than 10GB, improving reliability for large codebases. Several functions in the Go codebase are updated to propagate and handle this flag.

Large repository handling:

  • Added a --no-large command-line flag to the Go binary (main.go) to skip files larger than 100MB when analyzing repositories, preventing OOM errors on large repos. This flag is propagated through all relevant functions and passed to the scc tool. [1] [2] [3] [4] [5] [6] [7] [8]
  • In the Python service (software_value_service.py), added logic to check the repository size before running the Go binary. If the repo is larger than 10GB, the --no-large flag is automatically added to the command invocation. [1] [2]

Code robustness and clarity:

  • Improved error handling and messaging in several places, including fixing a typo in an error message and updating usage instructions to reflect the new flag. [1] [2]
  • Added a helper function in Python to determine repository size using du -sb.

Note

Medium Risk
Changes the software value pipeline to conditionally skip large files and bypass analysis for specific repos, which can alter reported metrics and relies on new size-detection shelling out (du).

Overview
Improves software value analysis robustness for very large repositories by adding a --no-large flag to the Go software-value binary and propagating it through SCC execution (skipping files >100MB via scc --no-large --large-byte-count 100000000).

Updates the Python SoftwareValueService to (1) skip a hardcoded excluded repo ID entirely and (2) automatically enable --no-large when repo disk usage (via du -sb) is >= 10GB, plus minor usage/error-message cleanup in the Go binary.

Written by Cursor Bugbot for commit 85a3030. This will update automatically on new commits. Configure here.

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

mbani01 added 6 commits March 23, 2026 16:30
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 changed the title chore: allow dry-run for non-prod envs fix: software value failing for large repos [CM-1029] Mar 24, 2026
@mbani01 mbani01 marked this pull request as ready for review March 24, 2026 17:26
Copilot AI review requested due to automatic review settings March 24, 2026 17:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves reliability of the software value calculation for very large repositories by adding an option to skip very large files during SCC analysis, and enabling that option automatically for large repos in the Python worker.

Changes:

  • Added a --no-large CLI flag to the Go software-value binary and propagated it through SCC invocations.
  • Updated Python SoftwareValueService to compute repository disk usage and automatically add --no-large for repos ≥ 10GB.
  • Minor robustness/clarity improvements (usage text, error message cleanup).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
services/apps/git_integration/src/crowdgit/services/software_value/software_value_service.py Adds repo-size detection and conditionally appends --no-large to the binary invocation.
services/apps/git_integration/src/crowdgit/services/software_value/main.go Introduces --no-large flag and passes it through to SCC execution (including large-file threshold args).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mbani01 mbani01 self-assigned this Mar 25, 2026
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 force-pushed the fix/software_value_failure branch from 6685eac to b5f1207 Compare March 25, 2026 09:43
@mbani01 mbani01 merged commit 1b1ecbd into main Mar 25, 2026
14 checks passed
@mbani01 mbani01 deleted the fix/software_value_failure branch March 25, 2026 09:45
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

func runSCC(sccPathPath string, noLarge bool, args ...string) (string, error) {
var cmdArgs []string
if noLarge {
cmdArgs = append(cmdArgs, "--no-large", "--large-byte-count", "100000000")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --large-line-count override causes unintended file exclusion

Medium Severity

When --no-large is enabled, scc filters files exceeding either the byte count or the line count threshold. The code sets --large-byte-count to 100000000 but does not set --large-line-count, so scc's default of 40000 lines applies. This means source files with more than 40000 lines (but well under 100MB) are silently excluded from the COCOMO cost calculation, understating the software value — even though the stated intent is only to skip files larger than 100MB.

Fix in Cursor Fix in Web

skwowet pushed a commit that referenced this pull request Mar 25, 2026
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants